Design considerations and text selection for BREF, a large French read-speech corpus
نویسندگان
چکیده
BREF, a large read-speech corpus in French has been designed with several aims: to provide enough speech data to develop dictation machines, to provide data for evaluation of continuous speech recognition systems (both speaker-dependent and speaker-independent), and to provide a corpus of continuous speech to study phonological variations. This paper presents some of the design considerations of BREF, focusing on the text analysis and the selection of text materials. The texts to be read were selected from 4.6 million words of the French newspaper, Le Monde. In total, 11,000 texts were selected, with an emphasis on maximizing the number of distinct triphones. Separate text materials were selected for training and test corpora. The goal is to obtain about 10,000 words (approximately 60-70 min.) of speech from each of 100 speakers, from different French dialects.
منابع مشابه
BREF, a large vocabulary spoken corpus for French
This paper presents some of the design considerations of BREF, a large read-speech corpus for French. BREF was designed to provide continuous speech data for the development of dictation machines, for the evaluation of continuous speech recognition systems (both speaker-dependent and speakerindependent), and for the study of phonological variations. The texts to be read were selected from 5 mil...
متن کاملPronunciation Variants Across Systems, Languages and Speaking Style
This contribution aims at evaluating the use of pronunciation variants across different system configurations, languages and speaking styles. This study is limited to the use of variants during speech alignment, given an orthographic transcription and a phonemically represented lexicon, thus focusing on the modeling abilities of the acoustic word models. Parallel and sequential variants are tes...
متن کاملSpeaker-Independent Phone Recognition Using BREF
A series of experiments on speaker-independent phone recognition of continuous speech have been carried out using the recently recorded BREF corpus. These experiments are the first to use this large corpus, and are meant to provide a baseline performance evaluation for vocabulary-independent phone recognition of French. The HMM-based recognizer was trained with hand-verified data from 43 speake...
متن کاملIssues in Large Vocabulary, Multilingual Speech Recognition
In this paper we report on our activities in multilingual, speaker-independent,large vocabulary continuous speech recognition. The multilingual aspect of this work is of particular importance in Eu-rope, where each country has its own national language. Our existing recognizer for American English and French, has been ported to British English and German. It has been assessed in the context of ...
متن کاملTranscribing broadcast news shows
While significant improvements have been made over the last 5 years in large vocabulary continuous speech recognition of large read-speech corpora such as the ARPA Wall Street Journal-based CSR corpus (WSJ) for American English and the BREF corpus for French, these tasks remain relatively artificial. In this paper we report on our development work in moving from laboratory read speech data to r...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1990